Red Wine Exploration by Darui Zhang
What property makes good red wine? In this project we try to answer this question by exploring the red wine data set.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Univariate Plots Section
Feature Names and Summary
This red wine data set contains 1,599 obersvations with 11 variables on the chemical properties of the wine.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Quality Distribution
The wine quality grade is a discrete number. It is ranged from 3 to 8. The median value is at 6.

Distribution of Other Chemical Properties
## Warning: position_stack requires constant width: output may be incorrect

Univariate Analysis
Some observation on the distribution of the chemical property can be made:
Normal: Volatile acidity, Density, PH
Positively Skewed: Fixed acidity, Citric acid, Free sulfur dioxide, Total sulfur dioxide, Sulphates, Alcohol
Long Tail: Residual sugar, Chlorides
Rescale Variable
Skewed and long tail data can be transformed toward more normally distribution by taking square root or log function. Take Sulphates as a example, we compare the original, square root and log of the feature.

Both the square root and the log function helps transform the feature toward normal distribution. In comparison, the log scale feature is more normal distributed.
Bivariate Plots Section
Bivariate Plots Selection
Plot matrix was used to have a glance at the data. We are interested the correlation between the wine quality and each chemical property.

The top 4 factors that is correlated with the wine quality (with a correlation coeffcient greater than 0.2)
| alcohol |
0.476 |
| volatile.acidity |
-0.391 |
| sulphates |
0.251 |
| citric.acid |
0.226 |
Bivariate Analysis
Alcohol content has the biggest correlation value to the wine quality. The scatter plot of alcohol and wine quality is shown below.

The original plot looks over plotted, so we add alpha value and 0.1, 0.5 and 0.9 percentile line to show the general trends.

In this plot the trend of increasing wind quality with the increasing of alcohol content can be clearly observed.
Distribution Analysis
In this analysis, we try to find out if the distribution of the chemical properties are different at different wine grade.

Note that sine the data size for each quality is not equal, the distribution of higher and lower grades are hard to see.
A normalized plot is shown below.

The plot looks a little busy. We group 2 grade together: grade 3,4 as “Low”, grade 5,6 as “Medium”, grade 7,8 as “High”. And plot again.

The new plot looks cleaner.
Similar analysis was done on the 3 other factors: volatile acidity, sulphates and citric acid.


As stated in section 1 the sulphates data is skewed, we tried both the original and the log scale of the feature.


The log scaled feature looks more spread out and therefore preferable.
Correlation Between Features
There is interesting correlaiton between two of the main features: Volatile acidity and Citric acid.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

##
## Pearson's product-moment correlation
##
## data: redwine$volatile.acidity and redwine$citric.acid
## t = -26.4891, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
Multivariate Plots Section
Main Chemical Property vs Wine Quality
With different colors, we can add another dimension into the plot. There are 4 main features.Alcohol, volatile acidity are the top two factor that affect wine quality.

The figure looks over ploted, since the wine quality are discrete numbers. We can use jitter plot to alleviate this problem

We can see higher quality wine have higher alcohol and lower volatile acidity.
Add Another Feature
Now we add the third feature, the log scale of sulphates, and use different facet to show wine grade.

We can see higher quality wine have higher alcohol (x-axis), lower volatile acidity (y-axis) and higher sulphates (hue).
Main Chemical Properties vs Wine Quality
Since we can visualized 3 dimensions, including wine quality, at a time. Two graphs will be needed to visualize the 4 main chemical properties.

The same trend of alcholand volatile acidity’s effect on wine qaulity can be observed.

We can see higher quality wine have higher sulphates (x-axis), higher citric acidity (y-axis).
Linear Multivariable Model
Linear multivariable model was created to predict the wine quality based on chemical properties.
The features are selected incrementally in order of how strong the correlation between this feature and wine quality.
##
## Calls:
## m1: lm(formula = quality ~ volatile.acidity, data = redwine)
## m2: lm(formula = quality ~ volatile.acidity + alcohol, data = redwine)
## m3: lm(formula = quality ~ volatile.acidity + alcohol + sulphates,
## data = redwine)
## m4: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid, data = redwine)
## m5: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid + chlorides, data = redwine)
## m6: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid + chlorides + total.sulfur.dioxide, data = redwine)
## m7: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid + chlorides + total.sulfur.dioxide + density,
## data = redwine)
##
## ==================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## --------------------------------------------------------------------------------------------------
## (Intercept) 6.566*** 3.095*** 2.611*** 2.646*** 2.769*** 2.985*** -0.953
## (0.058) (0.184) (0.196) (0.201) (0.202) (0.206) (11.990)
## volatile.acidity -1.761*** -1.384*** -1.221*** -1.265*** -1.155*** -1.104*** -1.114***
## (0.104) (0.095) (0.097) (0.113) (0.115) (0.115) (0.120)
## alcohol 0.314*** 0.309*** 0.309*** 0.292*** 0.276*** 0.280***
## (0.016) (0.016) (0.016) (0.016) (0.017) (0.020)
## sulphates 0.679*** 0.696*** 0.871*** 0.908*** 0.903***
## (0.101) (0.103) (0.111) (0.111) (0.112)
## citric.acid -0.079 0.021 0.065 0.044
## (0.104) (0.106) (0.106) (0.124)
## chlorides -1.663*** -1.763*** -1.747***
## (0.405) (0.403) (0.406)
## total.sulfur.dioxide -0.002*** -0.002***
## (0.001) (0.001)
## density 3.923
## (11.944)
## --------------------------------------------------------------------------------------------------
## R-squared 0.153 0.317 0.336 0.336 0.343 0.352 0.352
## adj. R-squared 0.152 0.316 0.335 0.334 0.341 0.349 0.349
## sigma 0.744 0.668 0.659 0.659 0.656 0.651 0.652
## F 287.444 370.379 268.912 201.777 166.407 143.910 123.298
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1794.312 -1621.814 -1599.384 -1599.093 -1590.662 -1580.192 -1580.138
## Deviance 883.198 711.796 692.105 691.852 684.595 675.689 675.643
## AIC 3594.624 3251.628 3208.768 3210.186 3195.324 3176.384 3178.276
## BIC 3610.756 3273.136 3235.654 3242.448 3232.964 3219.401 3226.670
## N 1599 1599 1599 1599 1599 1599 1599
## ==================================================================================================
The model of 6 features has the lowest AIC (Akaike information criterion) number. As the number of features increase the AIC becomes higher. The parameter of the predictor also changed dramatically which shows a sign of overfitting.
The model can be described as:
wine_quality = 2.985 + 0.276xalcohol - 2.985xvolatile.acidity + 0.908xsulphates + 0.065xcitric.acid - -1.763*chlorides - 0.002xtotal.sulfur.dioxide
Final Plots and Summary
Plot One

Description One
Plot one shows the distribution of wine quality. Note that dataset is unbalanced. It has many count for medium quality (grade 5, 6), but much fewer count on low (grade 3,4) and high (grade 7, 8) quality wine.
Plot Two:

Description Two
The 4 features that have the highest correlation coefficient are alcohol, volatile acidity, sulphates, citric acid. The wine quality are grouped to low (3,4) medium (5.6) and high (7,8). High quality wine have high alcohol level however, there is no significant different between medium and low quality wine. Volatile acidity decrease as wine quality increases. Sulphate and critic acid increase as wine quality increase.
Plot Three

Description Three
The 4 features are also represented in the scatter plot. 2 features are plotted at a time with color indicate wine quality. Similar trend as the last figure can be observed. In general, high quality wine tend to have higher alcohol and lower volatile acidity content. They also tend to have higher sulphate and higher critic acid content.
Reflection
The red wine dataset contains 1,599 observation with 11 variables on the chemical properties. We are interested in the correlation between the features and wine quality. Unlike the diamond price, which is the dominated by their size or carat. The wine quality is more complex. It does not have a obvious driver. Most of the data visualization in this project was done on the 4 features that have the highest correlation coefficient: alcohol(0.476), volatile acidity(-0.391), sulphates(0.251),citric acid(0.226). After some web research, the reflection about these chemical component are as follows.
Alcohol: surprisingly and unsurprisingly, alcohol is the No.1 factor correlated to the wine quality. The data strongly suggest that the higher the alcohol content, the more likely the better wine quality. One suggestion is that wine of higher alcohol are made from riper grapes, which tend to have intense flavor. Therefore, the relation between alcohol and wine quality are more likely to be correlation rather than causation. There is also controversy about alcohol level. One article even says “high alcohol is a wine fault not a badge of honor”. [1][2]
Volatile acidity: volatile acidity has a negative correlation to wine quality. Volatile acidity can contributed to acidic tastes which is often considered a wine fault.[3]
Sulphates: sulphates has a positive correlation with wine quality. It is often added by winemakers to prevent spoilage. It is less likely that sulphates itself contribute to better taste or aroma. Its present simply means the wine is less likely to be spoiled.[4]
Citic acids: unlike volatile acid, citic acids has positive correlation with wine quality . Winemaker often add citric acid to give a “freshness” test. However it can also bring unwanted effects through bacteria metabolism.[5]
Surprisingly, other chemical proprieties do not have strong correlation with wine quality, such as the residual sugar and PH .
In the end, a linear model of 6 features was created to predict wine quality. However, wine quality is a complex object. Different type grape can largely affect the wine test. There are many nuance in taste and aroma the that cannot be capture by examine its chemical component. The linear model is a overly simplified model. Good wine is more than perfect combination of different chemical components.
Future improvement can be made if more data can be collected on both low-quality and high-quality wine. I noticed that the dataset is highly unbalanced. It has many data points for medium quality wine (grade 5, 6). However, for low quality (grade 3,4 ) and high quality (grade 7, 8) wine, it has fewer data points. If the data set has more records on both the low end and high end, the quality of analysis can be improved. We can be more certain about whether there is a significant correlation between a chemical component and the wine quality.